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Abstract: We present a method for clustering genomic sequences based on variations in 
local entropy. We have analyzed the distributions of the block entropies of viruses and 
plant genomes. A distinct pattern for viruses and plant genomes is observed. These 
distributions, which describe the local entropic variability of the genomes, are used for 
clustering the genomes based on the Jensen-Shannon (JS) distances. The analysis of the JS 
distances between all genomes that infect the chlorella algae shows the host specificity of 
the viruses. We illustrate the efficacy of this entropy-based clustering technique by the 
segregation of plant and virus genomes into separate bins. 
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1. Introduction 

The organization of genomes has evolved dynamically by a stochastic process comprised of 
mutation and selection. Another interesting open problem is to test whether the overall organization of 
genomes is subject to evolutionary pressures. In this paper, we examine and compare the local 
sequence entropies of several genomes. The traditional Shannon's entropy is a measure for 
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disorder, and is defined as H(X) = -^/>(x,)log ? p{x t ), where, X is a random variable with realizations 



{x\, xi, xn} drawn from a discrete sampling space S. The />(x ; ) is the probability that Xj occurs. 
In the case of genomes, S is the nucleotide alphabet of the DNA. When applied at the nucleotide level 
to genomic sequences, the sequence entropy for the full genome can be reduced to the question of 
CG-content [1,2]. This is a global property and related to the mutation rate, while the selective 
advantage reveals itself locally in, e.g., gene products such as regions coding for proteins. 

The paper is organized as follows: Section 2 discusses the concept of superinformation, and lays the 
ground- work for entropy based clustering. In Section 3, we propose a method for clustering genomic 
sequences based on variations in local entropy. The conclusions are given in Section 4. 

2. Superinformation Revisited 

Shannon's entropy represents the information content of the data in the average sense and some of 
the features, e.g., the local variations due to selection in particular, are lost. This shortcoming 
motivated Bose et al. [3] to introduce the concept of superinformation. The entire genomic sequence is 
subdivided into N blocks, of length B nucleotides each. Depending on the local characteristics of the 
data, the blocks have different information content {i.e., measure for randomness). The z' th block has 
entropy H(X,). By definition, H(Xj) is a non-negative quantity. Then, a probability measure can be 
derived by the following algorithm: 

(i) Construct the histogram of the H(Xi) values, i.e., 

{^.(X ; ,M)}= Histogram ({//(X.)}),/= 1,2, ...,N ( 1) 

where, the Histogram function collects the elements of vector {//(X.)} into M equally spaced bins and 
returns the number of elements in each bin. 

(ii) Form a probability measure by normalization: 

H.(X.,M) 
p j (X i ,M)= ^ ,j=\,2,...,M 

Then, pj{X t ,M) is the frequency of H(X j ) in the / h bin. 
Superinformation is then given by 

M 

HXX.B, M) = - J>, (X,,M) log 2 P] (X„ M) (3) 

as a measure of the "entropy of entropy" and B defines the resolution at which this superinformation 
is calculated. 

3. Clustering of Genomic Sequences 



In this section, we propose a method for clustering genomic sequences based on variations in local 
entropy. The analysis includes several viruses from the family of Phycodnaviridae as well as 
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eukaryotic organisms. The latter comprise higher and lower plants. The algae, which represent the 
lower plants, are specific hosts of the viruses [4,5]. For the relationship between the lower and the 
higher plants, it is worth noting that the Chlorella species belong to the green algae and are hence 
considered ancestors of the higher plants. The brown alga Ectocarpus, on the other hand, belongs to 
the Heterocontophyta, a group of algae, which separated very early from the green algae [6]. 

Viruses in the family of the Phycodnaviridae are huge, icosahedral viruses with large 
double-stranded DNA genomes. They replicate in a host specific manner in algae [4]. Virus EsVl has, 
as a specific host, the brown alga Ectocarpus siliculosus. The viruses PBCV-1, NY2A and AR158 
replicate exclusively in the chlorella species C. NC64A. The hosts for the remaining viruses are closely 
related Chlorella species for which we have no sequence information yet [4]. The viruses considered 
here are very suitable for this analysis. The annotation of the 330 kbp genome of the prototype virus 
Paramecium bursaria chlorella virus 1 (PBCV-1) identified ca. 366 protein-encoding genes and 11 tRNA 
genes. More than half of the predicted gene products resemble proteins from pro- and eukaryotes with 
a known function [4]. It is interesting to note is that these virus-encoded proteins are either the smallest 
or among the smallest proteins of their class; some are so much reduced that they represent not more 
than the minimal catalytic unit. A further interesting feature of these viruses is that they also have, 
unlike most other viruses, introns. Virus PBCV-1 for example has three types of introns: a self-splicing 
intron, a spliceosomal processed intron, and a small tRNA intron. 

Accumulating evidence indicates that the chlorella viruses have a very long evolutionary history 
possibly dating back to the time when eukaryotes arose from prokaryotes [7-9]. They are predicted to 
have a common ancestor with the poxviruses (e.g., vaccinia virus), asfarvirues, iridoviruses, 
ascoviruses and mimiviruses [7,8]. Collectively, these viruses are referred to as nucleocytoplasmic 
large DNA viruses (NCLDVs). We now show that the local variations in entropy are not only very 
useful for clustering viruses and plant genomes, but may also suggest host specificity of viruses. 

We start with analyzing the superinformation content of viruses and plant genomes. The choice of 
block size B in Equation (3) was adjusted to stabilize the results for the superinformation, as shown in 
Figure 1. Intuitively, B defines the resolution at which superinformation is calculated. The figure 
shows the sensitivity of the probability density functions (pdfs) of sequence entropies with respect to 
block size B for the viral genome of PBCV1 and Chlorella NC64A. From Figure 1 we deduce that a 
choice of B = 100 implies stability of the subsequent analysis. For B = 50 we obtained severe 
fluctuations in the pdfs as is the case for both PBCV1 and Chlorella NC64A in Figure 1. For B = 100, 
B = 150, and B = 200 we obtain more regular histograms and therefore more stable superinformation 
values. However, the smaller the B the better the resolution and this we have opted to use B = 100 in 
the subsequent parts as it provides stability and good resolution at the same time. 

Similar sensitivities have been observed for the other seven plant viral genomes (AR158, ATCV, 
CVM1, FR483, MR325, NY2A, TN603), Arabidopsis thaliana, Ectocarpus siliculosus and EsV-1. 
Thus, we use this value of B in the subsequent parts of this study. It should be noted that the block-size 
(B = 100) is much smaller than the typical length of isochores (homogeneous domains) [10]. 
For example, in Arabidopsis genome, the length of GC isochore is of the order of 1 million base pairs [11]. 
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Figure 1. Sensitivity of the probability density function (pdf) of sequence entropies with 
respect to block size B for (a) the viral genome of PBCV1; and (b) Chlorella NC64A. 
We also show the superinformation H s . 
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In Table 1 we show the superinformation values for the genomic data of two plant hosts and their 
respective viruses. These results suggest the existence of a distinct pattern in these genomic sequences, 
implying differing selective pressures that shape the range of local variability (block entropies) and of 
global variability (superinformation). To derive evolutionary distances between all the sequences, the 
superinformation H s is, however, unsuitable. It should be noted that due to Chargaff s second parity 
G%~C%, A%~C%, plus the fact that GC% + AT% = 1, four base compositions (three are 
independent) in a block are reduced to one variable: GC%. So the entropy of a block, more or less has, 
a one-to-one correspondence with the GC%, and distribution of block-entropies are similar to 
distribution of a function of GC%. Thus the superinformation for the whole sequence corresponds to 
some measure of the window-GC% distribution (e.g., variance). 
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Table 1. Superinformation values H s for various genomic sequences. Note the tendency of 
increased H s for organisms in comparison to viruses. 



Genomic Sequence 



H s [Bit] 



AR158 
ATCV 
CVM1 
FR483 
MT325 
NY2A 
PBCV1 
TN603 
EsV-1 



4.21 
3.62 
3.87 
3.83 
3.77 
4.22 
4.38 
3.65 
3.54 
5.27 
5.10 
4.74 
4.10 



Arabidopsis thaliana 



Chlorella NC64A 
Oryza sativa 
Ectocarpus siliculosus 



Continuing with the rationale above that the distribution of local sequence entropies is a signal for 

evolutionary pressure of the environment, we decided to take one step back and use these distributions 
of block entropies p j (X i ,M) instead. These distributions describe, in detail, the local variability of the 

sequences and thus any distance of such p.(X.,M) clusters entities based on the variability of 

entropies. The distributions of the block entropies of the genomic sequences are shown in Figure 2. 
Upon visual inspection, there appears to be a distinct pattern for viruses and plant genomes. The 
distributions for plant genomes show a larger variance as compared to that of the viruses. This 
motivates us to explore clustering based on the distributions of block entropies. We note that 
characterization of sequences based on the distribution of their sub-sequences have been explored 
earlier [12-14]. 

Figure 2. Plot of the probability density function (pdf) of the entropy for 8 plant viral 
genomes (AR158, ATCV, CVM1, FR483, MR325, NY2A, PBCV1, TN603), Arabidopsis 
thaliana, Chlorella NC64A, Ectocarpus siliculosus and EsV-1. The plots were obtained for 
a block size of B = 100. 
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We can quantify the differences in the respective p.(X j ,M)by generalized Kullback-Leibler 

divergences [15], the Jensen-Shannon-distances in particular [16] as discussed in [17,18]. The 
Jensen-Shannon-divergence, D JS {p,q) between the entropy distributions pj{X n M) and q^X^M) for 

two data sources is given by 

D JS ( p, q) = y 2 Vkl (p 1 1 m) + D u (q \ \ m) 
^(p||«) = Z^(^,M)log 2 - 

(4) 



7=1 

M 



D KL (q\\m)^^ J (^iM)log 2 

j=i rrijiX^M) 

mj (X i ,M) = l / 2 (Pj( X i ,M) + qj (X,,M)) 

d{P,q) = ^D JS {p,q) 

Here, m^X^M) is an "intermediate" distribution and the D^p \\ m) and D^q \\ m) are the 

Kullback-Leibler distances between p and m or between q and m, respectively. A suitable metric for 
clustering is then d (p,q) [16]. The application of Jensen-Shannon distance to DNA has also been 
carried out by previous researchers [19]. 

We first computed the distances d{p,q) between all genomes that infest chlorella algae. We then 

used distance-matrix based clustering [20] as implemented in the statistical software R (R2009). 
In Figure 3, we clearly see that the genomic variability is able to reflect the host specificity of the 
viruses. The viruses that have a common host are closer to one another in terms of their block-entropy 
distributions. For example, the viruses FR483, CVM1 and MT325 that specifically infect Chlorella 
Pbi, are clustered together. The viruses PBCV-1, NY2A and AR158 that replicate exclusively in the 
chlorella species C. NC64A, are clustered together. This is a valuable finding and probably suggests 
that the respective host environment gives rise to unique selective pressures. It should be noted that 
(i) MT325 and FR483 are strongly related genomes sharing 94% of their genes with an average 86% 
amino-acid sequence identity and an almost identical gene order [21] and that (ii) both of these 
genomes are related to PBCV-1, but to a lesser extent (82%, 50%, weak gene order conservation). 
Hence, the identification of MT325 and FR483 (infecting the same host) to be more closely related 
than they are to PBCV-1 would be captured by sequence alignment tools. 

Figure 3. Clustering of viral genomes, annotated by their respective hosts — in this case 
variants of the Chlorella algae. The branch lengths are proportional to the phylogenetic 
Jensen-Shannon distance of Equation (4), the scale is indicated at the bottom. 
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These findings motivated us to augment the data set by including: (a) the host genome itself 
{Chlorella NC64A, [22]); (b) an independent host-virus genome pair (Ectocarpus siliculosus, [5]); and 
(c) other plant genomes, of which only a few are available up to this day. The plant genomes that we 
have included in our study belong to Populus trichocarpa (black cottonwood), Oryza sativa (Asian 
rice) and Arabidopsis (a small flowering plant). This set-up allows us to investigate whether the 
evolution of the host genomes is subject to the same evolutionary dynamics, leading to similar 
variabilities in the local entropies, as is the case for their respective viruses. By this, we can judge 
whether there exists differential evolutionary pressure on host and pathogenic genomes. Figure 4 
shows the results of this experiment. 

Figure 4. Clustering by the entropy-variability distance of Equation (1) for the extended 
genomic sequence set. We show the distance matrix d(p,q) for any genomic sequence pair- 
ing (p,q) for each genomic sequence with respect to each sequence (center plane, 
red = small values, green = high values). The bar on the left indicates whether the sequence 
is of viral origin (blue) or a living organism (orange). Note that the tree is not rooted. 




We see a clear separation of the genomic sequences when we cluster based on local entropy of the 
genomic data. In particular, there is a clear separation between viral and host genomes. This is a very 
interesting result and shows that clustering based on local entropy is able to assign the plants and 
viruses to different groups. The green alga and the higher plants form a sub-clade, while the plant 
viruses form a separate one. This comes out very clearly in Figure 4. The other interesting observation 
is that the brown alga (Ectocarpus siliculosus) is separate from the green alga (Chlorella) and other 
higher plants (Populus trichocarpa, Oryza sativa and Arabidopsis). This is consistent with the fact that 
these plants separated very early in evolution [6]. The fact that the genome of Ectocarpus silicolosus 
includes the entire genome of the respective virus EsVl [5] may contribute to this separate position. 



Viruses 2014, 6 



2266 



Interestingly, the viruses infecting Chlorella subspecies include the Ectocarpus silicolosus virus; this 
occurs even though the viruses have very different lifestyles [4]. 

4. Conclusions 

In this paper, we have proposed a novel method for clustering genomic sequences based on 
variations in local entropy. The clustering of the genomes on the basis of the Jensen-Shannon distances 
clearly brings out the host specificity of the viruses, i.e., the viruses that have a common host are closer 
to one another in terms of their block-entropy distributions. The proposed entropy-based technique is 
also able to segregate plant and virus genomes into separate bins. Our clustering technique also 
resulted in brown alga being separate from the green alga and other higher plants, which is consistent 
with the fact that these plants separated very early in the process of evolution. 
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